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Abstract — We consider dictionary learning and blind calibra- 
tion for signals and matrices created from a random ensemble. 
We study the mean-squared error in the limit of large signal 
dimension using the replica method and unveil the appearance 
of phase transitions delimiting impossible, possible-but-hard and 
possible inference regions. We also introduce an approximate 
message passing algorithm that asymptotically matches the the- 
oretical performance, and show through numerical tests that it 
performs very well, for the calibration problem, for tractable 
system sizes. 

I. Introduction 

Matrix decomposition Y = FX, where Y is known and one 
seeks F and X, with requirements (e.g. sparsity, probability 
distribution of elements etc.) on the properties of X and F, 
is a generic problem that appears in many applications HI. 
Theoretical limits on when matrix decompositions is possible 
and tractable are still very poorly understood. In this work 
we make a step towards this understanding by determining 
the limits of matrix decomposition when Y is created using 
randomly generated matrices F and X. 

Consider a set of P if -sparse A^-dimensional vectors ("sig- 
nals"), with iid components created from the distribution 
(denoting p = K/N) 



(1) 



where i = l, . . . , N, 1 = 1, ... ,P. For each of the vectors we 
perform M linear measurements, summarized by a M x N 
measurement matrix with iid elements J 7 ^ = F^/yN (this 
rescaling ensures that the measurements are of 0(1)) where 
the F^j are generated from a probability distribution (pp^F^) 
(in the numerical examples we will always consider a Gaussian 
distribution of zero mean and unit variance). We have only 
access to the (noisy) results of these measures, that is, to the 
P vectors yi such that 
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where is a Gaussian additive noise with variance A. 

Is it possible to find both the vectors x° and the matrix 
(dictionary) F° (up to a permutation of N elements and their 
signs)? This is the dictionary learning problem. A related 
situation is when one knows at least a noisy version of the 
matrix F°, defined by F' = (F° + iJfjW) / where W 



is a random matrix with the same statistics as F°, P(F a \F') 
then reads, for each matrix element 



P(F°jF' i )=Af(- 
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Recovering F° and x°, knowing this time F' and the P vectors 
yi is a problem that we shall refer to as blind calibration. It 
becomes equivalent to dictionary learning when r) —> 00. 

Our goal here is to analyse optimal Bayes inference (that 
provides the MMSE (minimal MSE)) where the signal x® t and 
the dictionary F^ are estimated from the marginals of the 
posterior probability 

P(x lh F^ l7 F^) = ^^{P{F^\F' lli ) 
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A. Related works 

There are several algorithm suggested and tested for dictio- 
nary learning, see e.g. Q, Q, J4), J5J- The algorithm we 
derive in this paper is closely related but different to the 
bilinear AMP proposed by |6| as explained in Sec. Ill 



The question of how many samples P are necessary for 
the dictionary to be identifiable has a straightforward lower 
bound MP > N(M + Pp), otherwise there is more unknown 
variables than measurements and hence exact recovery is 
clearly impossible. Several works analyzed what is a sufficient 
number of samples for exact recovery. While early rigorous 
results were able to show learnability from only exponential 
many samples [7|, more recent analysis of convex relaxation 
based approaches shows that O(A^logiV) samples are needed 
for a = 1 [8 1 and polynomially many for a < 1 0. 
Another study of sample complexity for dictionary learning 
for a < 1 establishes a O(A^logiV) bound for the number 
of samples ifTOl . A very recent non-rigorous work suggested 
that P = O(N) samples should be sufficient to identify the 
dictionary [11]. That work was based on the replica analysis of 
the problem, but did not analyze the Bayes-optimal approach. 

Several works also considered blind calibration, where only 
a uncertain version of the matrix F is known, on the other hand 
one has the access to many signals and their measurements 




p=K/N jc=P/N 



Fig. 1. Phase diagram for a = 0.5 and A = 0. For both blind calibration and 
dictionary learning, exact learning is possible by the Bayes optimal approach 
above the full red line it* = q " - . However, such a sampling procedures will 
not be tractable below the spinodal transition n s (r]) shown here in dotted line 
for dictionary learning (?y = oo) and blind calibration. 



such that calibration of the matrix F is possible, see e.g. lfT2l 
and reference therein. Cases when both the signal and the 
dictionary are sparse are also considered in the literature, e.g. 
|[T3l . and our theory can be applied to these as well. 

B. Main results 

The present paper has three main results. First, using the 
replica method [14] we estimate the Bayes optimal MMSE in 
the limit of large signals N — > oo. In particular, we define a= 
M /N, it — P/N and show that for the noiseless case, A = 0, 
exact reconstruction is possible if it > tt* = a /(a — p) and 
a > p. In this regime, it is thus possible to recover the matrix 
and the signal exactly if one can compute the marginals of the 
posterior probability distribution Q. This result is striking, all 
the more because it is independent of 77. 

Computing the marginals of the posterior probability distri- 
bution is an extremely hard problem, all the more when there 
is a phase transition in the problem. We determine the value 
7r s (77) (the spinodal transition) below which iterative sampling 
(using for instance Monte Carlo Markov chains or message 
passing) is believed to be intractable in polynomial time. 

Finally, we introduce an AMP-like message passing algo- 
rithm designed to perform such sampling, and show that it 
performs very well for the calibration problem. However, at 
the moderate size we are able to study numerically, finite- 
size deviation from the asymptotic behavior become large as 
7] grows and prevents our algorithm to function nicely in the 
dictionary learning limit. However, we believe that this still 
sets a very promising stage for new algorithmic development 
for dictionary learning. 

II. Asymptotic analysis with the replica method 

A. Replica analysis for matrix decomposition 

The MMSE obtained from the Bayes-optimal approach can 
be computed exactly in the limit of large N via the replica 
method. Although this method is in general non-rigorous, it 
is sometimes possibles to prove that the results derived from 
it are exact., We shall leave out details of the derivation and 
refer instead to lfT31 . Ifl6ll for a very similar computation in the 



Fig. 2. MMSE D (for the matrix F) and E (for the signal x) corresponding 
to p = 0.2, a = 0.5, A = 0, for three values of r\. The MMSE jumps 
abruptly from a finite value to zero at 7r* . However, sampling should remain 
intractable until the spinodal transition arises at a larger value 7r s (r?) and we 
denote the corresponding MSE in dotted line. The figure remains qualitatively 
the same for small value of the additive noise A > 0, where the MMSE at 
large 7r is not zero but rather O(A). If A is large enough, however, the sharp 
transition disappears and the MMSE is continuous (see e.g. Fig. [4j. 



case of compressed sensing. We estimate the Bayes optimal 
MMSE by computing the marginals of the matrix and signals 
elements. Our computation is also very similar to the one of 
ifTTl . who however did not analyze the Bayes optimal MMSE. 

We now compute $ = E^o^'^o (log Z(F°, F', x°))/NP, 
where Z, the so-called partition sum (or evidence), is the 
normalization in eq. Q for a given instance of the problem. 
In order to do so we compute Kpo f x o(Z n ) and then use the 
replica trick logZ = \im n ^ (Z n — We use the replica 

symmetric ansatz which is correct for inference problems with 
prior distributions corresponding to the generative process. The 
final result is that in the large signal limit, N — > 00, the MMSE 
D (on the matrix) and E (on the signals) are given by 

MMSE(a, tt, p, A, 77) = argmax$(£:, D) , (5) 

E ,D 

where the so-called "potential" is given by 

Hm =-% log (A + E + D(p - E))- A+ a E ^_ E) 
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Here denotes an average of a function / of the 

random variable u with distribution Q(u), T>z a Gaussian 
measure with zero mean and unit variance. The probability 
distributions P(x), and P(F\F' ) are the single-element dis- 
tributions introduced in eqs. ([TJ and ([3J. Finally we denoted 

a(l-D) w(p-E) 



E + pD - ED' 



A + E + pD - ED 
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Note that the present expression for the potential is very 
general and can be used to study many similar problems, 
such as matrix completion, or sparse matrix decomposition, 
by changing the distribution of the matrix and of the signal. 



B. Gaussian matrix and Gauss-Bernoulli signal 

When the matrix elements are generated from a Gaus- 
sian with zero mean and unit variance, and x% from Gauss- 
Bernoulli distribution, the potential simplifies to 

a(A + p) 



Signal variables 



Matrix variables 



D)=--log (A+E+D(p - E))- 
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The Bayes-optimal MMSE is obtained by maximizing 
$>(E,D). Analyzing the above expression in the zero-noise 
limit (A = 0) allows to demonstrate our first main result: in 
both the blind calibration and dictionary learning problems, 
the global maximum is given by D = E = 0, with $ — > oo, as 
long as a > p and tt > it* = a /(a — p). Hence for it > ir* it 
is possible to learn the matrix and the signal exactly from the 
measurements. This result is striking since it is independent 
of 77 as long as rj > 0; and coincides with the simple counting 
lower bound. When 77 — > 0, more precisely when 77 <C A, then 
the compressed sensing phase transition — that goes to a = p 
when A — > — is recovered independently of it. 

Bayes-optimal learning, however, requires exact sampling 
from the measure Q, and this remains an extremely hard 
computational problem. In this regard, another important tran- 
sition (the "spinodal"), that can be studied from the form of the 
potential function $(E,D), is the appearance of a high-MSE 
local maxima. This phenomenon marks the downfall of many 
sampling strategy, such as Gibbs sampling or message-passing 
strategy (that performs a steepest ascent in the $>(E, D) func- 
tion). We determine the value 7r s (77) (the so-called "spinodal" 
transition) above which &(E,D) does not have the spurious 
secondary maxima. As shown in Fig. [T] and Fig. [2] it depends 
strongly on the value 77. In fact, when 77 — > 0, we recover 
again the compressed sensing limit we have obtained in IT6ll . 
Figs. [2] and [4] depict the values of MMSE for the signal E and 
the matrix D reached for large systems N — >• 00 by the Bayes 
optimal sampling and by local sampling strategies that reach 
D = E = discontinuously at 7T S . 

The behavior with finite noise A < 77 is also interesting and 
we observe a phenomenology similar to that described in |16| 
in the case of compressed sensing. Because of a two-maxima 
shape of the function <&{E, D) for moderate A, the MMSE 
displays a sharp transition separating a region of parameters 
with a small MMSE, comparable to A, from a region with a 
larger 0(77) MMSE. For larger value of A we do not see any 
abrupt transition, and the MMSE continuously decays with it 
(see e.g. Fig. |4|). 

To conclude, there exist three regions in the phase diagram, 
corresponding to impossible (below 71*), intractable (below 
ir s ), and perhaps-tractable learning. We will now study algo- 
rithmically the later one. 



Factor nodes 




Fig. 3. Factor graph used for the belief propagation inference, here drawn 
using N = 3, P = 2 and M = 2. The factor nodes ensure (in probability) 
the condition =£\ T^xa + 



III. Message-passing algorithm 

To make the Bayes optimal sampling tractable we have to 
resort to approximations. In compressed sensing, the Bayesian 
approach combined with a belief propagation (BP) reconstruc- 
tion algorithm leads to the so-called approximate message 
passing (AMP) algorithm. It was first derived in ifTTll for the 
minimization of l\, and subsequently generalized in [18], [ 19 1. 
We shall now adapt this strategy to the present case. 

A. From BP to AMP 

The factor graph corresponding to the posterior probability 
Q is depicted in Fig. [3] The canonical BP iterative equations 
are written for messages m, n, rh, n and read 



M 



(8) 



n^iiF^) cx P(F Mi |i^) nW/rf^i). (9) 
U'fyk-^l(-F , /*fc) Y[ m jl^/j,i( x jl) i (10) 

k j^i 

f Y-r (yi~Ei -F^ii/V") 2 

n^ni{F Mi ) oc dxji^dF^e 

Y\_n fJ ,k^-^i(F f j,k)[l'niji-y^i(xji) . (11) 

k^i j 

A major simplification of these iterative equations arises 
when one uses the central limit theorem and realizes that only 
the two first moments of the above distributions are important 
for the leading contribution when N — > 00. This "Gaussian" 
approximation is at the basis of approximate message passing 
as used in compressed sensing. The next step of the derivation 
again neglects 0(1/N) terms and allows to reduce the number 
of messages to be iterated from 0(N A ) to 0(N 2 ). This leads 
to a TAP-like set of equations [20|. Finally if the matrix 
F° and signal x° have known distribution of elements this 
further simplifies the algorithm. A full derivation will be given 



elsewhere and we instead refer the reader to the re-derivation 
of AMP in ifTBI where we followed essentially the very same 
steps. 

The final form of our algorithm for blind calibration and 
dictionary learning follows. We denote the mean and variance 
of the BP estimates of marginals over Xu as an and cu, and 
those over T^i as and s^/. We define several auxiliary 
values (all of them are of order 0(1)): 
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Then our AMP algorithm reads 
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where only the following functions are prior-dependent: 
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(22) 
(23) 
(24) 



Initial conditions are set so that the marginals correspond to the 
means and variances of the prior, and cj m ; = y^i . One iteration 
of the algorithm takes 0(N 3 ) steps. In practice, we also damp 
the expressions ( 17|18|19|20[ i to ensure convergence. If the 
matrix elements are not learned, this algorithm reduces to the 
AMP for matrix uncertainty from 11211 . and is asymptotically 



equivalent to the MU-AMP of (221. The bilinear AMP sug- 
gested in [6] for matrix decomposition consists of fixing a 
current value of F and its uncertainty, running MU-AMP, then 
fixing a current values of x and its variances, running MU- 
AMP, and repeating. Whereas the difference in performances 
between the present algorithm and the one of |6| is yet to be 
studied, it is not clear if the later implements asymptotically 
the Bayes optimal inference, and neither if the state evolution 
(next paragraph) applies to it. 

B. State evolution 

The AMP approach is amenable to asymptotic (N — > oo) 
analysis using a method known as "cavity method" in statis- 
tical physics ifPfll . or "state evolution" in compressed sensing 
ifTTl . Given the parameters p, a,n rj, A, the MSE given by 
our approach follows, in the infinite size limit: 

E t+1 = (l-^y^z/^TO^ZyTTQ 

+ p J Vzf c (m x ,zy/ml + rh x ) , (25) 
1 



D t+i = 
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where m F and rh x follow from eq. (jfj}, with E t ,D t on the 
right hand side, E t=0 = p, D t=0 = 1, T>z is a Gaussian 
integral, and f c is defined by eq. ( [22| . 

From eqs. ( |25|26 1, one can show that the evolution of the 
algorithm is equivalent to a steepest ascent of the potential 
4>(E, D) obtained in eq. (j7j). This explains the peculiar mean- 
ing of the spinodal transition arising at tt s and shows that for 
7r > 7r s our algorithm should approximates correctly the Bayes 
optimal inference for matrix decomposition for large systems, 

— > oo, as AMP does for compressed sensing. Note that in 
the later case, the state evolution approach has been proveen 
rigorously [23 1 and would be interesting to see if this could 
be generalize to the present, arguably more complex, case. 

C. Numerical tests 

We have tested our algorithm on instances of tractable 
sizes. The results are shown in Fig. |4]for two noisy cases of 
matrix calibration. The agreement with the theoretical large 
N prediction is excellent for small it. In the large ir region, 
however, we observe finite-size corrections going roughly as 
Tj/N. Despite these finite size effect, the MSE reached by 
the algorithm is excellent in both case. To appreciate the 
performance of the algorithm, note that a l\ -minimization 
would give very poor results even with a perfectly known 
matrix, as the values of a and p are above the Donoho-Tanner 
transition |24|. 

The presence of 0(rj/N) corrections prevents us from using 
our algorithm successfully for large -q. This means that so far 
we are not able to solve efficiently the dictionary learning. 
More work will be needed to reduce these effects. 
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Fig. 4. Comparison of the performance of the AMP algorithm with the 
MMSE D (for the matrix) and E (for the signal, fewer points are shown for 
visibility) for different system sizes N . Top: A case with continuous decay 
of the MMSE for A = 1CT 8 , r) = 10" 4 , a = 0.3, p = 0.1. The initial 
error on the matrix is about 1%. Bottom: A case with a jump in the MMSE 
for A = 10 -8 , r\ = 10 -2 , a = 0.5, p = 0.2, the initial error on the matrix 
is about 10%. As AT increases, the MSE found by the algorithm approaches 
the MMSE computed theoretically. In the large it region, corrections to the 
asymptotic behavior are roughly proportional to r//N. 

IV. Perspectives 

It would be interesting to see if our result for the MMSE and 
the exactness of the state evolution can be proven rigorously, 
as in compressed sensing |25l . l23l . Further, it is important 
to investigate if our algorithm can be improved and the finite 
size effects reduced. One can also generalize our approach to 
other matrix decomposition problems and their applications. 
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